Full - Text Access to Historical Newspapers Tapas Kanungo and

نویسندگان

  • Tapas Kanungo
  • Robert B. Allen
چکیده

Newspapers are rich records of U.S. history. Due to the deterioration of older newspapers, the National Endowment for the Humanities is archiving 19th century newspapers on microfilm. Although microfilm is a good preservation method, it provides limited access to researchers and the general public. We are building a system to provide universal access to digital images and full-text content of historical newspapers. The system has three main components: (a) An Optical Character Recognition (OCR) module that converts digitized images into searchable text and identifies regions. (b) An Information Retrieval module that applies linguistic information to aid in segmentation, indexing, and retrieval of the noisy OCR’d text. (c) A User Interface module that allows historians and educators to query and view retrieved documents. Thus far, we have developed two OCR techniques targeted to processing historical newspapers and we have built a user interface to search the OCR output and superimpose matches on a page image from the newspaper. This research was funded in part by the Department of Defense and the Army Research Laboratory under Contract MDA 9049-6C-1250.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design

Most tools for accessing digitized historical newspapers emphasize relatively simple search; but, as increasing numbers of digitized historical newspapers and other historical resources become available, we can consider much richer modes of interaction with these collections. For instance, users might use exploratory search for looking at larger issues and events such as elections and campaigns...

متن کامل

The Architecture of Trueviz: a Groundtruth/metadata Editing and Visualizing Toolkit the Architecture of Trueviz: a Groundtruth/metadata Editing and Visualizing Toolkit the Architecture of Trueviz: a Groundtruth/metadata Editing and Visualizing Toolkit

Tools for visualizing and creating groundtruth and metadata are crucial for document image analysis research. In this paper we describe TrueViz [LK00, KLCB01], which is a tool for visualizing and editing groundtruth/metadata. We rst describe the groundtruthing task and the requirements for any interactive groundtruthing tool. Next we describe the system design of TrueViz and discuss how a user ...

متن کامل

A Framework for Text Processing and Supporting Access to Collections of Digitized Historical Newspapers

Large quantities of historical newspapers are being digitized and OCRd. We describe a framework for processing the OCRd text to identify articles and extract metadata for them. We describe the article schema and provide examples of features that facilitate automatic indexing of them. For this processing, we employ lexical semantics, structural models, and community content. Furthermore, we desc...

متن کامل

Delivering the Maori-Language Newspapers on the Internet

Although any collection of historical newspapers provides a particularly rich and valuable record of events and social and political commentary, the content tends to be difficult to access and extremely time-consuming to browse or search. The advent of digital libraries has meant that for electronically stored text, full-text searching is now a tool readily available for researchers, or indeed ...

متن کامل

Large-scale refinement of digital historic newspapers with named entity recognition

Within the Europeana Newspapers project (www.europeana-newspapers.eu), full-text will be produced for over 10 million pages of digitised historical newspapers by applying Optical Character Recognition (OCR) and Optical Layout Recognition (OLR). In order to further increase the usability of the full-text, Named Entity Recognition (NER) is also applied to materials in Dutch, German and French lan...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999